Techscape 2021 - Machine Learning

Team members - Group 20

Alice Vale r20181074

Eva Ferrer r20181110

Rafael Sequeira r20181128

Raquel Sousa r20181102

Rogério Paulo m20210597

Import libraries

Import data

Data Exploration

Data pre-processing

Incoherence check

Outliers inspection

Feature engineering

Exploration

Data exploration after new features

Create dataset for modeling

Principal Components Analysis

Feature Selection

Subsets to test

Preliminary assessment of the performance of subsets created

Conclusion:

Note: it was used standardized data without outliers to perform the tests since its the data that generally works better for algorithms, although there might be some exceptions

Predictive modeling stage

Logistic Regression

Models tested:

Best model:

Gaussian Naive Bayes

Models tested:

Best model:

K-Nearest Neighbors

Adapted function to calculate optimal number of neighbors and f1-score

Models tested:

Best model:

Decision trees

A special parameter: ccp_alpha

Best model:

Neural Networks

Parameter search space

Models tested:

Best model:

Passive-Aggressive Classifier

Models tested:

Best model:

Quadratic Discriminant Analysis

Used only for subset4 and subset5, since the objective was to combine it with the ensemble later on

Parameter search space:

Models tested:

Best model:

Support Vector Machine

Note: the method for tuning the parameters was the same as in practical classes, and not to burden the notebook it was excluded

Models tested:

Best model:

Bagging

Best model:

Random Forest Classifier

Best model:

AdaBoost Classifier

The optimal parameters were found using the same method as in practical classes, but excluded here to improve readability

Models tested:

Best model:

Gradient Boost Classifier

Models tested:

Best model:

Stacking Classifier

For Subset4 with robust scalling without outliers

For Subset5 with robust scalling with outliers

For Subset4 with standard scalling with outliers

Voting Classifier

For Subset4 with standard scalling with outliers

Adjusting the weights given to each classifier

For Subset5 with robust scalling with outliers

Multi-modal classification

BounceRate = 0

Tested with and without the use of SMOTE. Here presented is only the best one: With SMOTE

BounceRate > 0

Tested with and without the use of SMOTE. Here presented is only the best one: With SMOTE